Project: Differences in pre-processing results if -s (scale) parameter of fastq-mcf is altered
Date: 08 November 2019
Report version: 0.2
Author(s): Kohl Kinning

Linda Crnic Institute for Down Syndrome, University of Colorado Anschutz Medical Campus

The -s (scale) parameter was set to 0 in the pipeline. This parameter is interpreted as log2(s), so the setting of 0 results in adapter trimming after a single base match. We noticed over-aggressive trimming of the ends of reads due to such a low threshold for matching. In this report, we initially look at the differences of several -s settings and then continue to look at the differences between a setting of 0 and 1.6.

Libraries

library(ggplot2)
library(ggpubr)
library(dplyr)
library(plotly)

Difference in total counts

A match of 1 and 2 bases looks similar, while higher settings result in a sharp drop off.

trim_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/trimming_results.csv", header=TRUE, sep=",")

ggplot(trim_counts, aes(x=as.character(scale), y=clipped_end_reads)) +
  geom_bar(stat="identity") +
  xlab("-s parameter") + ylab("Clipped reads") +
  ggtitle("Effect of -s parameter on read clipping") +
  theme_pubr()

Read the counts data in

Look at the HTSeq resulting gene count data for the setting of 0 and 1.6.

zero_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/counts/0/CD4_100B2_HTSeq_counts.txt", row.names=1)
one_six_counts <- read.table("~/Projects/CD4_CD8_NK_trimming_investigations/counts/1.6/CD4_100B2_HTSeq_counts.txt", row.names=1)

bound_counts <- zero_counts
names(bound_counts) <- "lower_thresh"
bound_counts$higher_thresh <- one_six_counts$V2
bound_counts$diff <- bound_counts$lower_thresh - bound_counts$higher_thresh
bound_counts$ratio_change <- bound_counts$diff/bound_counts$lower_thresh

Scatter plot

Observe no outliers when comparing the counts between the settings. Any changes are minor and in a consistent direction.

ggplotly(
ggplot(bound_counts, aes(x=log2(lower_thresh), y=log2(higher_thresh), label=rownames(bound_counts))) +
  geom_point(shape=1) +
  xlab("log2(-s 0 resulting counts)") + ylab("log2(-s 1.6 resulting counts)") +
  ggtitle("Difference in resulting counts\ndue to -s parameter") +
  theme_pubr()
)

Total difference in counts

Look at the total difference in HTSeq counts due to the -s parameter change

sum(bound_counts$lower_thresh)-sum(bound_counts$higher_thresh)
[1] 144020

Zero count genes

Number of genes lost completely

colSums(bound_counts[1:2]==0)
 lower_thresh higher_thresh 
         9305          9313 

Average change

Look at the average per-gene counts change, using the absolute differences

mean(abs(bound_counts$diff))
[1] 6.622256

Gene gain

Which genes do we gain with the higher thresh?

gained_genes <-
  setdiff(
    rownames(bound_counts[which(bound_counts$lower_thresh==0),]),
    rownames(bound_counts[which(bound_counts$higher_thresh==0),])
  )

bound_counts[gained_genes,1:3]

Gene loss

Which genes do we lose with the higher thresh?

lost_genes <-
  setdiff(
    rownames(bound_counts[which(bound_counts$higher_thresh==0),]),
    rownames(bound_counts[which(bound_counts$lower_thresh==0),])
  )

bound_counts[lost_genes,1:3]

Highest differences

Look at the genes which show the largest differences in counts

LS0tCnRpdGxlOiAiYHIgcGFyYW1zJHRpdGxlYCIKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6IAogICAgdGhlbWU6IGNvc21vCiAgICB0b2M6IHllcwogICAgdG9jX2Zsb2F0OgogICAgICBjb2xsYXBzZWQ6IHllcwplZGl0b3Jfb3B0aW9uczogCiAgY2h1bmtfb3V0cHV0X3R5cGU6IGlubGluZQpwYXJhbXM6CiAgdmVyc2lvbjogMC4yCiAgdGl0bGU6IEZhc3RxLW1jZiAtcyBwYXJhbWV0ZXIgY2hhbmdlCiAgcHJvamVjdDogRGlmZmVyZW5jZXMgaW4gcHJlLXByb2Nlc3NpbmcgcmVzdWx0cyBpZiAtcyAoc2NhbGUpIHBhcmFtZXRlciBvZiBmYXN0cS1tY2YgaXMgYWx0ZXJlZAogIGRhdGU6ICFyIGZvcm1hdChTeXMudGltZSgpLCAnJWQgJUIgJVknKQogIGF1dGhvcjoKICAgIC0gS29obCBLaW5uaW5nCiAgZW1haWw6CiAgICAtIGtvaGwua2lubmluZ0BjdWFuc2NodXR6LmVkdQogIGFmZmlsaWF0aW9uOiBMaW5kYSBDcm5pYyBJbnN0aXR1dGUgZm9yIERvd24gU3luZHJvbWUsIFVuaXZlcnNpdHkgb2YgQ29sb3JhZG8gQW5zY2h1dHogTWVkaWNhbCBDYW1wdXMKLS0tCgoqKioKPGZvbnQgc3R5bGU9ImZvbnQtc2l6ZToxOHB0Ij4gUHJvamVjdDogYHIgcGFyYW1zJHByb2plY3RgIDwvZm9udD48YnI+CkRhdGU6ICBgciBwYXJhbXMkZGF0ZWAgIApSZXBvcnQgdmVyc2lvbjogYHIgcGFyYW1zJHZlcnNpb25gICAgCkF1dGhvcihzKTogYHIgcGFyYW1zJGF1dGhvcmAgIApgciBwYXJhbXMkZW1haWxgICAKYHIgcGFyYW1zJGFmZmlsaWF0aW9uYAoKVGhlIGAtc2AgKHNjYWxlKSBwYXJhbWV0ZXIgd2FzIHNldCB0byBgMGAgaW4gdGhlIHBpcGVsaW5lLiBUaGlzIHBhcmFtZXRlciBpcyBpbnRlcnByZXRlZCBhcyBsb2cyKHMpLCBzbyB0aGUgc2V0dGluZyBvZiBgMGAgcmVzdWx0cyBpbiBhZGFwdGVyIHRyaW1taW5nIGFmdGVyIGEgc2luZ2xlIGJhc2UgbWF0Y2guICBXZSBub3RpY2VkIG92ZXItYWdncmVzc2l2ZSB0cmltbWluZyBvZiB0aGUgZW5kcyBvZiByZWFkcyBkdWUgdG8gc3VjaCBhIGxvdyB0aHJlc2hvbGQgZm9yIG1hdGNoaW5nLiBJbiB0aGlzIHJlcG9ydCwgd2UgaW5pdGlhbGx5IGxvb2sgYXQgdGhlIGRpZmZlcmVuY2VzIG9mIHNldmVyYWwgYC1zYCBzZXR0aW5ncyBhbmQgdGhlbiBjb250aW51ZSB0byBsb29rIGF0IHRoZSBkaWZmZXJlbmNlcyBiZXR3ZWVuIGEgc2V0dGluZyBvZiBgMGAgYW5kIGAxLjZgLgoKIyBMaWJyYXJpZXMKCmBgYHtyfQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkoZ2dwdWJyKQpsaWJyYXJ5KGRwbHlyKQpsaWJyYXJ5KHBsb3RseSkKYGBgCgojIERpZmZlcmVuY2UgaW4gdG90YWwgY291bnRzIAoKQSBtYXRjaCBvZiAxIGFuZCAyIGJhc2VzIGxvb2tzIHNpbWlsYXIsIHdoaWxlIGhpZ2hlciBzZXR0aW5ncyByZXN1bHQgaW4gYSBzaGFycCBkcm9wIG9mZi4KYGBge3J9CnRyaW1fY291bnRzIDwtIHJlYWQudGFibGUoIn4vUHJvamVjdHMvQ0Q0X0NEOF9OS190cmltbWluZ19pbnZlc3RpZ2F0aW9ucy90cmltbWluZ19yZXN1bHRzLmNzdiIsIGhlYWRlcj1UUlVFLCBzZXA9IiwiKQoKZ2dwbG90KHRyaW1fY291bnRzLCBhZXMoeD1hcy5jaGFyYWN0ZXIoc2NhbGUpLCB5PWNsaXBwZWRfZW5kX3JlYWRzKSkgKwogIGdlb21fYmFyKHN0YXQ9ImlkZW50aXR5IikgKwogIHhsYWIoIi1zIHBhcmFtZXRlciIpICsgeWxhYigiQ2xpcHBlZCByZWFkcyIpICsKICBnZ3RpdGxlKCJFZmZlY3Qgb2YgLXMgcGFyYW1ldGVyIG9uIHJlYWQgY2xpcHBpbmciKSArCiAgdGhlbWVfcHVicigpCmBgYAoKIyBSZWFkIHRoZSBjb3VudHMgZGF0YSBpbgoKTG9vayBhdCB0aGUgSFRTZXEgcmVzdWx0aW5nIGdlbmUgY291bnQgZGF0YSBmb3IgdGhlIHNldHRpbmcgb2YgYDBgIGFuZCBgMS42YC4KYGBge3J9Cnplcm9fY291bnRzIDwtIHJlYWQudGFibGUoIn4vUHJvamVjdHMvQ0Q0X0NEOF9OS190cmltbWluZ19pbnZlc3RpZ2F0aW9ucy9jb3VudHMvMC9DRDRfMTAwQjJfSFRTZXFfY291bnRzLnR4dCIsIHJvdy5uYW1lcz0xKQpvbmVfc2l4X2NvdW50cyA8LSByZWFkLnRhYmxlKCJ+L1Byb2plY3RzL0NENF9DRDhfTktfdHJpbW1pbmdfaW52ZXN0aWdhdGlvbnMvY291bnRzLzEuNi9DRDRfMTAwQjJfSFRTZXFfY291bnRzLnR4dCIsIHJvdy5uYW1lcz0xKQoKYm91bmRfY291bnRzIDwtIHplcm9fY291bnRzCm5hbWVzKGJvdW5kX2NvdW50cykgPC0gImxvd2VyX3RocmVzaCIKYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2ggPC0gb25lX3NpeF9jb3VudHMkVjIKYm91bmRfY291bnRzJGRpZmYgPC0gYm91bmRfY291bnRzJGxvd2VyX3RocmVzaCAtIGJvdW5kX2NvdW50cyRoaWdoZXJfdGhyZXNoCmJvdW5kX2NvdW50cyRyYXRpb19jaGFuZ2UgPC0gYm91bmRfY291bnRzJGRpZmYvYm91bmRfY291bnRzJGxvd2VyX3RocmVzaApgYGAKCgoKIyBTY2F0dGVyIHBsb3QKCk9ic2VydmUgbm8gb3V0bGllcnMgd2hlbiBjb21wYXJpbmcgdGhlIGNvdW50cyBiZXR3ZWVuIHRoZSBzZXR0aW5ncy4gQW55IGNoYW5nZXMgYXJlIG1pbm9yIGFuZCBpbiBhIGNvbnNpc3RlbnQgZGlyZWN0aW9uLgpgYGB7cn0KZ2dwbG90bHkoCmdncGxvdChib3VuZF9jb3VudHMsIGFlcyh4PWxvZzIobG93ZXJfdGhyZXNoKSwgeT1sb2cyKGhpZ2hlcl90aHJlc2gpLCBsYWJlbD1yb3duYW1lcyhib3VuZF9jb3VudHMpKSkgKwogIGdlb21fcG9pbnQoc2hhcGU9MSkgKwogIHhsYWIoImxvZzIoLXMgMCByZXN1bHRpbmcgY291bnRzKSIpICsgeWxhYigibG9nMigtcyAxLjYgcmVzdWx0aW5nIGNvdW50cykiKSArCiAgZ2d0aXRsZSgiRGlmZmVyZW5jZSBpbiByZXN1bHRpbmcgY291bnRzXG5kdWUgdG8gLXMgcGFyYW1ldGVyIikgKwogIHRoZW1lX3B1YnIoKQopCmBgYAoKCgoKIyBUb3RhbCBkaWZmZXJlbmNlIGluIGNvdW50cwoKTG9vayBhdCB0aGUgdG90YWwgZGlmZmVyZW5jZSBpbiBIVFNlcSBjb3VudHMgZHVlIHRvIHRoZSBgLXNgIHBhcmFtZXRlciBjaGFuZ2UKYGBge3J9CnN1bShib3VuZF9jb3VudHMkbG93ZXJfdGhyZXNoKS1zdW0oYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2gpCmBgYAoKCiMgWmVybyBjb3VudCBnZW5lcwoKTnVtYmVyIG9mIGdlbmVzIGxvc3QgY29tcGxldGVseQpgYGB7cn0KY29sU3Vtcyhib3VuZF9jb3VudHNbMToyXT09MCkKYGBgCgojIEF2ZXJhZ2UgY2hhbmdlCgpMb29rIGF0IHRoZSBhdmVyYWdlIHBlci1nZW5lIGNvdW50cyBjaGFuZ2UsIHVzaW5nIHRoZSBhYnNvbHV0ZSBkaWZmZXJlbmNlcwpgYGB7cn0KbWVhbihhYnMoYm91bmRfY291bnRzJGRpZmYpKQpgYGAKCgoKIyBHZW5lIGdhaW4KCldoaWNoIGdlbmVzIGRvIHdlIGdhaW4gd2l0aCB0aGUgaGlnaGVyIHRocmVzaD8KYGBge3J9CmdhaW5lZF9nZW5lcyA8LQogIHNldGRpZmYoCiAgICByb3duYW1lcyhib3VuZF9jb3VudHNbd2hpY2goYm91bmRfY291bnRzJGxvd2VyX3RocmVzaD09MCksXSksCiAgICByb3duYW1lcyhib3VuZF9jb3VudHNbd2hpY2goYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2g9PTApLF0pCiAgKQoKYm91bmRfY291bnRzW2dhaW5lZF9nZW5lcywxOjNdCmBgYAoKCgojIEdlbmUgbG9zcwoKV2hpY2ggZ2VuZXMgZG8gd2UgbG9zZSB3aXRoIHRoZSBoaWdoZXIgdGhyZXNoPwpgYGB7cn0KbG9zdF9nZW5lcyA8LQogIHNldGRpZmYoCiAgICByb3duYW1lcyhib3VuZF9jb3VudHNbd2hpY2goYm91bmRfY291bnRzJGhpZ2hlcl90aHJlc2g9PTApLF0pLAogICAgcm93bmFtZXMoYm91bmRfY291bnRzW3doaWNoKGJvdW5kX2NvdW50cyRsb3dlcl90aHJlc2g9PTApLF0pCiAgKQoKYm91bmRfY291bnRzW2xvc3RfZ2VuZXMsMTozXQpgYGAKCgojIEhpZ2hlc3QgZGlmZmVyZW5jZXMKCkxvb2sgYXQgdGhlIGdlbmVzIHdoaWNoIHNob3cgdGhlIGxhcmdlc3QgZGlmZmVyZW5jZXMgaW4gY291bnRzCmBgYHtyfQpib3VuZF9jb3VudHMgJT4lIHRpYmJsZTo6cm93bmFtZXNfdG9fY29sdW1uKCkgJT4lIHRvcF9uKDUwLCBkaWZmKSAlPiUgYXJyYW5nZSgtZGlmZikKYGBgCgo=